Search CORE

843 research outputs found

Authorship attribution in portuguese using character N-grams

Author: Baptista Jorge
Markov Ilia
Pichardo-Lagunas Obdulia
Publication venue: 'Obuda University'
Publication date: 01/01/2017
Field of study

For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Portuguese by examining the performance of different types of character n-grams, and various combinations of them. The paper also experiments with different feature representations and machine-learning algorithms. Moreover, the paper demonstrates that the performance of the character n-gram approach can be improved by fine-tuning the feature set and by appropriately selecting the length and type of character n-grams. This relatively simple and language-independent approach to the AA task outperforms both a bag-of-words baseline and other approaches, using the same corpus.Mexican Government (Conacyt) [240844, 20161958]; Mexican Government (SIP-IPN) [20171813, 20171344, 20172008]; Mexican Government (SNI); Mexican Government (COFAA-IPN)

Crossref

Sapientia

Um filtro para palavras exóticas frequentes em Português

Author: Baptista Jorge
Faísca Luís
Publication venue: Universidade do Algarve
Publication date: 30/07/2014
Field of study

As formas gráficas (tokens) que constituem as palavras de um texto são muitas vezes ambíguas, podendo frequentemente uma mesma forma corresponder a diferentes flexões de duas ou mais entradas lexicais distintas. Algumas dessas formas correspondem a palavras ‘exóticas’, isto é, palavras pouco frequentes ou até caídas em desuso. O objectivo deste estudo é a determinação, a partir do corpus do CETEMPúblico, das formas ambíguas mais frequentes de palavras exóticas do Português, com vista à construção de um filtro que, durante a fase de análise lexical, elimine o ‘ruído’ provocado por essas formas exóticas e que permita assim reduzir a ambiguidade formal dos textos, simplificando as fases posteriores do seu processamento automático

Sapientia

Os provérbios em manuais de ensino de português língua não materna

Author: Baptista Jorge
Reis Sónia
Publication venue: 'Sociedade Brasileira de Computacao - SB'
Publication date: 01/01/2017
Field of study

Os provérbios apresentam uma grande variedade de estruturas e podem servir diversos propósitos comunicativos. Devido à sua riqueza cultural e linguística, prestamse ainda a múltiplos objetivos didáticos, nomeadamente no ensino de Português como Língua não Materna (PLNM). Neste trabalho, investigamos como são de facto utilizados os provérbios em manuais de PLNM, usando ferramentas e recursos de processamento computacional de linguagem natural (PLN). Os resultados são comparados com observações já feitas sobre um corpus de manuais de Português para falantes nativos.info:eu-repo/semantics/publishedVersio

Sapientia

Mapping, filtering and measuring impact of ambiguous words in Portuguese

Author: Baptista Jorge
Faísca Luís
Publication venue: Presses Universitaires de Franche-Comté
Publication date: 30/07/2014
Field of study

This paper deals with ambiguous simple words of Portuguese. The Portuguese dictionary of simple inflected words contains (DELAF) 936.215 entries, from which there are 889.986 different inflected forms. It is possible to obtain the full list of ambiguous inflected forms (43.126), that is, word forms belonging to different categories and/or lemmas: capital,A/N/N (capital). We may consider A/N/N an ambiguity class. There are 137 ambiguity classes. Each ambiguity class presents a certain level of ambiguity (Amb) that corresponds to the number of lexical entries associated to each ambiguous form (again, for class A/N/N Amb=3). Based on this information it is possible to map how ambiguity affects the lexicon. Using the frequency information associated to the list of tokens of a large corpus (the CETEMPÚBLICO corpus, with 200 million words), it is possible to calculate how ambiguity affects real texts. Combining the two types of information, it is possible to devise and evaluate different strategies to reduce lexical ambiguity

Sapientia

Portuguese proverbs: types and variants

Author: Baptista Jorge
Reis Sónia
Publication venue: Editions Tradulex
Publication date: 01/01/2016
Field of study

Drawing on the methodology and previous results of Rassi et al. (2014) on the automatic identification of Brazilian Portuguese proverbs, this paper reports on an extension of that experiment, but now focused on the identification of the European Portuguese proverbs and their variants. Based on a large collection of over 56 thousand Portuguese proverbs and their variants, a database of proverb types was specifically built for natural language processing, along with the finite-state tools that allow for the identification of these strings in texts. Our aim is to make these linguistic resources and language processing tools publicly available, which will undoubtedly be deemed useful assets to other paremiologic studies.info:eu-repo/semantics/publishedVersio

Sapientia

Estimating lexical availability of European Portuguese proverbs

Author: Baptista Jorge
Reis Sónia
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

This paper relates data on lexical availability with data on textual frequency of proverbs in European Portuguese. Each data source should provide different perspectives on the use of proverbs in the language. This should allow an empirically well-motivated selection of proverbs aiming at the development of NLP resources, specifically for applications for learning Portuguese as a Foreign Language and for the diagnosis/therapy of speech impairments/disabilities. A large database (over 114,000 proverbs and their variants) was independently classified by two annotators, according to intuitively estimated lexical availability. Next, a random, stratified sample was selected and lexical availability was then confirmed with an online survey. Frequency data was gathered from two web browsers and a large-sized, publicly available, corpus of journalistic texts. Results from the survey, the web and the corpus by and large confirm the initial intuitive classification and a core of commonly used proverbs was definedinfo:eu-repo/semantics/publishedVersio

Sapientia

Let's play with proverbs? NLP tools and resources for iCALL applications around proverbs for PFL

Author: Baptista Jorge
Reis Sónia
Publication venue
Publication date: 01/01/2016
Field of study

Proverbs are an important form of cultural expression of a society and are related to various areas of knowledge and human experience (González Rey, 2002). While linguistic elements in widespread use, proverbs are very rich structures both from a cultural and from a linguistic point of view and can therefore contribute significantly to the teaching of languages, both native and foreign (Council of Europe, 2001). However, though there are extensive collections of Portuguese proverbs with tens of thousands of forms and its variants (Reis, in preparation), its automatic identification in texts is quite difficult, given its formal variation, both lexical and syntactic (Chacoto, 1994). Nevertheless, using real examples, where proverbs are used in a natural or spontaneous discourse context, is a more natural way to learn and teach the complex conditions and communicative situations that determine the use and meaning of these expressions. On the other hand, frequency indices associated with proverbs and its variants would allow one to select the most common expressions. These are precisely the most interesting forms from the point of view of their teaching/learning and could serve as a basis for the construction of educational games, particularly for learning Portuguese autonomously as a foreign language (PFL) assisted by computer. To make this possible, it is necessary, first of all, be able to recognize the occurrence of proverbs in the texts (Rassi et al. 2014), including the instances where these expressions are presented in a truncated or creatively modified form, for example, to better suit the communicative situation or to produce new and more expressive meanings. In this paper, we present an on-going project, which aims at automatic identification of proverbs in texts. In this interdisciplinary study, we combine natural language processing tools with questionnaires construction techniques for teaching purposes (Hoshino and Nakagawa 2005, Correia et al. 2010). This is illustrated here with different sets of formats that can be built based on the knowledge of the form and variation of proverbs, as well as their frequency in corpora.info:eu-repo/semantics/publishedVersio

Sapientia

Vocatives in Portuguese: Identification and Processing

Author: Baptista Jorge
Mamede Nuno
Publication venue: OASIcs - OpenAccess Series in Informatics. 6th Symposium on Languages, Applications and Technologies (SLATE 2017)
Publication date: 01/01/2017
Field of study

This paper describes the most salient linguistic aspects of vocative constructions in Portuguese, with special reference to its European variety. Next, the paper presents the strategy followed for implementing this linguistic knowledge in a computational grammar of Portuguese, developed for the natural language processing chain STRING and using the XIP rule-based parser. Very precise and detailed linguistic descriptions can be implemented in this way

Dagstuhl Research Online Publication Server

CONSUMO DE ENERGIA E CUSTOS DE AQUECIMENTO NA PRODUÇÃO DE FLORES E LEGUMES EM ESTUFA

Author: Baptista F.J.
Meneses Jorge F.
Publication venue: Ed. Colibri - CEER
Publication date: 01/01/2011
Field of study

Pretende-se determinar os consumos de energia e os custos em aquecimento, na produção de flores e de vegetais, ao longo do ano, em estufas de plástico aquecidas, localizadas em diversas zonas de produção de culturas forçadas. No primeiro ano considerou-se Portugal e a produção de rosa. Foram calculados os consumos energéticos e os custos de aquecimento com gasóleo ou gás natural, para duas combinações de temperaturas mínimas do ar noite/dia, em estufas modernas de plástico. O estudo está a ser alargado para a produção de tomate, englobando Portugal e Espanha

Repositório Científico da Universidade de Évora